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INTRODUCTION 

The  problem  of  discrimination  or  classification  arises  when  an  investiga- 
tor is  given  an  item,   I  ,  which  is  known  to  have  come  from  one  of  k  speci- 
fied categories  or  populations,  and  is  asked  to  classify  this  item  into  the 
population  from  which  it  came.  The  classification  problem  becomes  statistical 
when  we  further  specify  that  the  available  evidence  about  I  consists  of 
observed  values  of  a  set  of  random  variables.  The  basis  for  classification  is 
dependent  upon  these  observed  random  variables  as  well  as  the  information 
available  about  the  k  populations.   In  most  practical  situations  it  may  be 
assumed  that  there  is  a  finite  number  of  populations  from  which  the  individual 
may  have  come,  and  that  each  population  is  characterized  by  a  probability 
distribution.  Thus  the  item  I  ,  is  a  random  observation  from  one  of  k 
specified  probability  distributions.   In  some  problems  the  probability  distri- 
butions for  all  of  the  k  populations  are  completely  known.   In  other  prob- 
lems the  probability  distributions  may  be  known  or  assumed  to  be  of  a  speci- 
fied type,  but  only  sample  estimates  of  their  parameters  are  available.   In 
all  cases  let  the  observed  random  variables  from  the  item  I  be  a  set  of 
measurements  of,  say,   p  characteristics  or  quantities,  taken  from  I  . 

Suppose  an  item  I  is  known  to  have  come  from  one  of  k  populations 
IT,  ,  .  .  . ,  it,  .  Denote  the  vector  of  p  measurements  taken  from  the  item  as 

1  K 

V  =  (x.,  .  .  . >  x  )  .  Let  R  denote  the  total  sample  space  which  consists 

of  all  possible  vector  measurements;  thus  R  is  a  p-dimensional  space.  The 

purpose  of  the  discriminant  function  is  to  divide  the  R  space  into  k 

mutually  exclusive  subspaces  R  ,  .  .  . ,  R,  ;  such  that  if  an  observation 

falls  in  the  region  R.  ,  it  would  be  classified  as  coming  from  population 

J 


7r,j=l,...,k.  The  criterion  for  determining  the  regions 

R  ,  .  .  .,  R,  ,  will  be  that  of  minimizing  either  the  probability  of  misclas- 
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sification  or  the  cost  of  misclassif ication. 

The  history  of  discriminatory  analysis  may  be  regarded  as  beginning  with 
the  work  of  Karl  Pearson  in  1920.   Pearson  was  faced  with  the  problem  of 
measuring  the  distance  between  two  multivariate  populations.  Pearson  proposed 
a  statistic  which  he  called  the  "coefficient  of  racial  likeness"  and  denoted 
it  as  C2  .  His  first  work  on  C   assumed  that  the  p  measured  character- 
istics were  independent.  Pearson  later  made  an  adjustment  in  the  "coefficient 
of  racial  likeness"  to  account  for  the  relationship  between  the  p  variates. 

In  1925  P.  C.  Mahalanobis  became  interested  in  the  subject  and  proposed 

o 
an  alternative  measure  which  he  called  D  .  In  1931  Hotelling  generalized 
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"Student's"  t  into  a  statistic  which  he  called  T  .  Hotelling' s  T   and 

Mahalanobis'  D   are  in  fact  equivalent;  however,  it  was  some  time  before 
this  equivalence  was  realized.   In  1936  Fisher  published  his  first  paper  on 
discriminant  functions.  The  main  difference  between  his  approach  and  that  of 
Mahalanobis  was  that  the  latter  was  measuring  distance  between  groups  whereas 
Fisher  was  merely  concerned  with  dividing  the  sample  space  into  two  regions 
and  classifying  a  sample  value  to  one  population  or  another  on  the  basis  of 
which  region  it  fell  into.  Then  in  1939  Welch  linked  up  the  theory  of 
discriminatory  functions  and  that  of  statistical  tests.  It  is  at  this  point 
where  one  is  interested  in  directions  between  group  and  regions  of  classifica- 
tion and  not  just  the  distance  between  groups  that  this  report  begins. 


CLASSIFICATION  INTO  ONE  OF  TWO  POPULATIONS  WITH 
KNOWN  PROBABILITY  DISTRIBUTIONS 


A  Priori  Probabilities  Are  Known 

Consider  a  situation  in  which  the  item  I  is  known  to  belong  to  one  of 

two  populations,  Tr.   or  rr_  .  Using  any  given  classification  procedure,  the 

investigator  could  make  two  kinds  of  errors  in  classification.  If  the  item 

belonged  to  rf    ,  it  could  be  classified  as  coming  from  re     ,  denoted  by 

P(2/l)  ;  and  if  the  item  belonged  to  tt„  ,  it  could  be  classified  as  coming 

from  tr  ,  denoted  as  P(l/2)  .  Let  the  a  priori  probability  of  selecting  an 

observation  from  population  tt.   be  represented  by  p.  .  Also  let  the  density 

function  of  population  tr-   be  represented  by  f . (v;  9. )  ,  where 

f.(v;  0.)  =  f.(x.,  .  .  .,  x   ;  9..,  .  .  .,  9  )  . 
11     l   1       '   p    ll  ip 

If  R  ,  the  total  sample  space,  is  subdivided  in;to  two  mutually  exclusive 

subspaces,  R.   and  R~  ,  such  that  all  observations  in  R.   are  classified  as 

belonging  to  it,  ,  and  all  observations  in  R^  are  classified  as  belonging  to 

ir_  ,  then  the  probability  of  correctly  classifying  an  observation  is 

Pl  1     fl(v?  V  dv  +  P2  1     f2^  V   dv   ■ 
Rl  R2 

where  dv  =  dx.  ,  .  .  . ,  dx  .  The  probability  of  misclassifying  an  observa- 
tion is  then 

M  =  Pl  J       fl(v;  91}  dv  +  P2  J       f2(v;  92}  dv  *         (1) 

Using  the  above  procedure  for  classification,  the  problem  becomes  that  of 
choosing  R   and  R?  so  as  to  minimize  M  . 

Using  the  method  of  Bayes,  the  a  posteriori  probabilities  that  I 


belongs  to  rr,   or  -rr2  may  be  computed.  That  is,  the  conditional  probability 
that  an  observation  came  from  a  certain  population  given  the  observed  values 
of  the  items'   p  measurements  may  be  computed.  For  instance,  given  the 
observed  values  for  the  p  variates  of  an  item,  the  conditional  probabilities 
that  the  item  belongs  to  population  it   or  to  population  ttv,  ,  are 

p^U;  Gj)  p2f2(v;  e2} 

p^U;  9X)  +  p2f2(v;  92)  and  p^^v;  9^  +  P2f2(v;  Qj    ' 

respectively. 

For  a  given  observation  I  ,  the  probability  of  misclassification  is  min- 
imized by  assigning  I  to  that  population  which  has  the  largest  conditional 
probability.  That  is,  if 

p^U;  ex)  p2f2(v;  e2) 

pifi(v;  ei)  +  p2(v;  e2)  >  pifi(v;  ex)  +  P2f2(v;  92)  ' 

then  I  is  classified  as  coming  from  population  it  .   If  the  direction  of 
the  inequality  is  changed  and  the  statement  holds,  I  is  classified  as  coming 
from  population  Tr„  .  When  neither  inequality  holds,  the  populations  are 
equally  probable  and  it  makes  no  difference  which  one  is  chosen.  To  make  the 
regions  R.   and  FL  mutually  exclusive  one  can  arbitrarily  assign  I  to 


R.   when 


Plfl(v;  V  =  P2f2(v;  92)  * 


Thus  the  subspaces  R.   and  R_  are 


fl(v;  V  >    *_2 

h   s     f2(v;  e2)  '  Pl  » 


R,  «    A..  Q  \  S  zr   .  (2) 


and 


f,(v;  e  )    p0 


respectively,  where  the  symbol   :   stands  for  "defined  by." 

It  should  be  noted  that  since  the  probability  of  misclassif ication  is 
minimized  at  each  point,  i.e.,  for  all  I  ,  it  is  minimized  for  the  entire 
space  R  . 

Now  the  question  arises:  Is  this  the  "best"  procedure?  The  best  proce- 
dure is  that  one  which  minimizes  the  probability  of  misclassif ication.  For 
any  procedure,  the  probability  of  misclassif ication  is  given  by  equation  (l), 
where  the  intersection  of  R.   and  R_  is  the  null  set  and  the  union  of  R. 
and  R_  is  the  entire  R  space. 

Equation  (l)  can  be  expressed  as 

M  =  J       [plfl(v;  V  "  P2f2(v;  62^  dv  +  J     P2f2(v;  92)  dv  *      (4) 
R_  R 

The  second  term  on  the  right  hand  side  of  equation  (4)  is  a  constant,  namely 
p„  .  Therefore  M  will  be  a  minimum  when  R~  includes  all  the  points  V 
such  that  p.f.(v;  0. )  -  p_f9(v;  ©9)  <  0  ,  and  excludes  all  the  points  V 
for  which  p.f.(v;  0  )  -  p  f_(v;  0_)  >  0  .  Thus  the  regions  R   and  R 
are  those  defined  by  equations  (2)  and  (3). 
If  it  is  assumed  that 


Pr 


fx(v;  81)     p2 


f2(v;  02) 


m 


=   0 


i  =  1,  2 


then  the  Bayes'  procedure  is  unique. 

If  the  expected  cost  of  misclassif ication  which  is  given  by 

C(2/l)  P]_  J      f^v;  ex)  dv  +  C(l/2)  p2  J     f2(v;  92)  dv  , 
R2  Rl 

is  to  be  minimized,  where  C(2/l)  is  the  cost  of  classifying  an  observation 

into  tt2  when  it  actually  comes  from  tr     ;  and  C(l/2)  is  the  cost  of 


classifying  an  observation  into  n*   when  it  actually  comes  from  ^   ,  then 
our  regions  become: 


Rl  ! 


and 


R2  : 


f^v;  9X)    P2C(l/2) 
f2(v;  Q2)  ^  PlC(2/l)  ' 


f^v;  ex)    P2C(l/2) 
f2(v;  92)  <  PlC(2/l)  * 


Classification  Into  One  of  Two  Known  Multivariate  Normal  Populations 


Now  let  us  apply  the  general  procedure  outlined  above  to  the  case  in 
which  the  two  populations  are  multivariate  normal  populations  with  equal 
covariance  matrices.  Let  N(\T   ,  £)  and  N(/   ,  L)     represent  the  two 
populations,  where  \i  =   (u^   ,  .  .  .,  u    )  is  the  vector  of  means  for 

the  kth  population,  and  Z.    is  the  common  covariance  matrix  with  elements 
<r,   i,  j  =  1,  2,  .  .  .,  p  .  Then  the  density  function  of  the  kth  population 
(k  =  1,  2)   is: 


Vv'  V  = 


I 


-1 


1/2 


k   k 


72 


ii  A  A °"iJ<*i  -  *****  -  «vj)] 


(5) 


-1 


I 


-1 


(2ir)P' 
where  the  a-1"1  are  the  elements  of  the  inverse  matrix  L        ,  and 
the  determinant  of  the  inverse  matrix  L        •  The  ratio  of  the  two  density 
functions  is: 


is 


f2(v;  e2; 


or  equivalently, 


f2(v;  e2) 

+x.p2.  +  x.y2.  -  y2.p2.)]  , 

.  e"  1   Ji  £  '^lihj  "  »W  +  Jl  Jl  ^IJ  "  *2J>  *i  <6> 

The  region  R   was  defined  as  the  set  of  V  for  which  (2)  holds.  Since  the 
logarithmic  function  is  monotonic  increasing,  the  inequality,  and  hence  the 
region  R  ,  can  be  written  in  terms  of  the  logarithm  of  (2).  Taking  the 
logarithms  of  these  inequalities  we  have: 

Rl  '     Jl  jl  ^"U  "  *V>   *1  '  2  '=1  k   ^Wli  ■  "W  "'  ln  '  '  (7) 

and 

R2  '    ill  ji  ""1*1)  -  V  *i  "  5  Ji  ji  fflJ("li"lJ  -  >W  <  ln  k  •  (8) 

for  k  suitably  chosen.  If  the  probability  of  misclassif ication  is  to  be 

P2 
minimized  and  the  a  priori  probabilities  are  known,  then  k  -  —  .  If  the 

cost  of  misclassification,  or  the  ratio  of  the  cost  of  misclassif ication,  is 
known  and  is  to  be  minimized  rather  than  just  the  probability  of  misclassifi- 
cation, then  k  becomes  p2C(l/2)  /  p  C(2/l)  .  For  the  particular  case  of 
the  two  populations  being  equally  likely  to  occur,  i.e.,  p^  =  p2  ,  and  the 
cost  of  misclassification  being  the  same  for  each  population,  R^  becomes 

Ji  ji  "iJSj  •  '2j)  -i  ?  5  ifi  &<iwu  -  W  •        (9> 

It  should  be  noted  that  if  the  covariance  matrices  are  not  equal  then  the 
region  R.   is  defined  by  the  quadratic  expression: 

"  5  ji  Jl  aiJ(x.-,u)(x.-Ml.)  +  i  j,  J,  ^(x.-^.Hx.-^.)  Sin  k  ,  (10) 


8 


where  (<3.1J)  and  (0  J)  are  the  inverses  of  the  covariance  matrices  of  the 
two  populations. 

No  A  Priori  Probabilities  Are  Known 
Using  matrix  notation  one  can  express  the  region  R.  as: 

Rt .       v'  r1  wM  -  n<2))  -  i(.(1)  *  /2V  r1  (h(1)  -  h(2))  2  1"  I 

where, 

(u(l)  -  u(2))  =  (HU  -  V21»  Hi2  "  H22»  •  •  •»  Vlp  "  H2p)  . 
is  the  vector  of  differences  between  the  population  means  for  the  p  charac- 
teristics, and  all  other  terms  are  as  defined  earlier. 

Given  the  a  priori  probabilities,  the  method  of  Bayes  was  used  to  deter- 
mine k  .  When  a  priori  probabilities  do  not  exist  or  are  unknown,  a  proce- 
dure must  be  sought  for  determining  k  .  Anderson  (1958)  proves  a  series  of 
theorems  which  enable  him  to  state  that  the  Bayes  procedure  R*  ,  for  which 
P(l/2)  =  P(2/l)  ,  is  a  minimax  procedure.  A  procedure  is  called  minimax  if 
the  maximum  expected  loss  is  a  minimum.  Hence  k  is  determined  so  that  the 
expected  losses  due  to  misclassif ication  are  equal. 

Consider  the  distribution  of  the  ratio  of  the  natural  logarithm  of  the 
density  functions  and  denote  this  ratio  by  U  . 

fU;  V  _  „  .   '  y-1,  (1)     (2)x    1/(D*   (2)n'  r-l/  (1)     (2)x 
f(v;  e  )  -  U  -  V  2.  (uv   -  uv   )  -  5  (F    +  F   )  £  (F    "  F   )  • 

'  (l) 

If  V   is  distributed  according  to  N(u   ,  £)   then 

ioi)  -  m(1) W15  -  ^(2))  -  ±wM  ♦  H(2>)'rl(H(1)  -  rw) 

or  equivalently, 


E(U 


>  =  kvM  -  p(2V  t   u-u)  -  p(2>)  • 


Denote  the  variance  of     U     by     cr   (U)    . 

Then 


o-2(U)  =     cr2(L)    ,      where     L  =  v'  I"1    (u(l) 


,(2)) 


I  0 


where 


c  =  [ryi)-„(2)>] 


i.e. 


Ci  =  ^^hl   "  ^12}   +  ^^12   "  V22)   +    '    '    '  +  fflP^ip  "  H2p)    • 

(    i  =   1,    .    .    .,   p) 


Note  that     L     can  be  expressed  as     .^-.    c.x.    . 
Wilks   (1962)   states  that 

"If     (x. ,    .    .    .,   x   )     has  the  p-variate  distribution 

N    (   pi    ;    IJor^H  )    ,  i,   j,  =   1,    .    .    .,   p 

then 

L  =  c.x.   +   .    .    .  +  c  x 
11  p  p 

has  the  distribution 

P  P       P 

N   (    .£,    c.u.    ;    .£,    .%,    c.c.   <r.  .)   " 
x    1=1     l   l    '    i=l   j=l      l   j      ij 


hence 
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cr2(L)  =  .?.  1.  c.c.  <y.  .  =  ff*(U)  . 
i=l  j=l  i  j  ij 

Therefore,  using  matrix  notation  one  has 
cr2(U)  =  c'  I  C 


a2(U)  = 


r1  <«">.„<»>)]'  [r][z-l<wci>.  A 


**<U)  -<p<l>  -  p<aV  tl  (H(l)-M(2)). 

Hence,  if  V   comes  from  it,    ,  U  is  distributed  as  N(-  ;  a)  ,  where 
a  =  (u^  -  p^2')'!"1  (u^  -  u^)  .  Similarly  if  V  comes  from  ir2  ,  U 
is  distributed  as  N(--  ;  a)  .  Thus  the  probability  of  misclassification, 
given  that  the  observation  comes  from  tt,   and  rr„  ,  is 


'(2/1)  =  p^rr  f 
\   2-na  u  -co 


-ir(u  -^a): 


2a 


dU  , 


and 


P(l/2)  = 


1  ,„  ,  1   x2 

/    e  dU  , 

V2ira  ^  d 


1 


respectively,  where  d  =  In  k  . 

Setting  P(2/l)  equal  to  P(l/2)  and  transforming  the  U  variate  to 
the  standard  normal  scale  we  have 


d  -  a/2     1  2 

)   \j  a     e       dz  = 

V27   -  co  x/2 


2tt  ^  d  + 


a/2 


1  2 

"2  Z   , 

e       dz  . 


/r 


In  order  for  this  equation  to  hold,  d  must  equal  zero. 

Hence,  when  no  a  priori  probabilities  are  known  the  problem  is  treated  in 
the  same  manner  as  if  a  priori  probabilities  are  known  and  are  equal.  Thus, 
for  this  case  the  regions  R.   and  R-  are  defined  by  equation  (7)  and  (8) 


11 


respectively,  where  k  -  1  . 

When  C(l/2)  j  C(2/l)  and  the  cost  of  misclassification  is  to  be  a  mini- 
mum then  d  can  be  determined  by  use  of  the  normal  tables  and  trial  and  error 
such  that,   C(2/l)  P(2/l)  =  C(l/2)  P(l/2)  . 

Unknown  Parameters 

So  far  we  have  assumed  that  the  distributions  of  the  two  populations  are 
completely  known.  In  most  practical  situations  the  two  populations  are  not 
completely  known.  However,  they  are  known  or  assumed  to  be  of  a  specified 
type  and  their  parameters  must  be  estimated  from  two  samples,  one  from  each 
population.  The  first  step  in  this  problem  is  that  of  testing  the  hypothesis 
that  the  two  samples  actually  do  come  from  different  populations.  For  clas- 
sification would  be  meaningless  unless  the  two  populations  are  distinguishable, 

2 
To  test  this  hypothesis  we  will  employ  the  T   statistic,  which  is  a  direct 

2 
generalization  of  the  Student  t  ,  derived  by  Hotelling  (1931).  The  T 

statistic  may  be  used  to  test  the  hypothesis  that  a  multivariate  sample  came 

from  a  specified  normal  population  or  that  two  independent  multivariate 

samples  have  been  drawn  from  the  same  normal  population.  In  the  two-sample 

problem,  the  normal  populations  must  have  the  same  but  unknown  covariance 

matrices. 

Let  x.  ..   denote  the  value  of  the  ith  variate  measured  on  the  ith 
kji 

individual  from  the  kth  sample,  where  k  =  1,  2,  ;  i  =  1 ,  2,  .  .  . ,  n,  ;  and 
i  =  1.  2,  .  .  .,  p  .  Let  x,  .  and  x„ .  be  the  arithmetic  means  of  the 
values  of  the  jth  variates  in  the  first  and  second  samples,  respectively, 
where 


12 


Next  define 


and 


„k  x.  .. 

"L     ku 

xkj   k=l  nk 


d.  =  XlJ  -  x2.  , 


n  =  nx  +  n2 


2  , 


nsjj'  =  i5i(xiji-^)(xij'i^ij,)  +  ii  ^2ji 

Now  form  the  estimate  of  our  covariance  matrix 


x2j^X2j'i 


x  ') 
2j  ; 


S  = 


'11 


'21 


Pi 


12  *  '  *  °lp 
•  •  •  s, 


22 


2p 


Hotelling's  T   statistic  is, 


T2  a  nln2 


E   I 


I 


nl  +  n2  J-1  i=1 


S  ,-*  •  •  •  S 

P2        PP 


n.n_ 
ij  ,     _   1  2 
s  J  d.  d.  - 


i   J   ni  +  n2 


'   -I 
DSD 


where  s1"1   is  the  element  in  the  ij  position  of  the  inverse  matrix  of  S 

and  D  =  (x, .  -  x„, ,  .  .  .,  x.   -  x0  )  .  Hotelling  proved  that  the  quantity 
11     2.1  lp    Zp 


n  +  1  -  p  2 


n«p 


T  ,  is  distributed  as  the  F-distribution  with  p  and  n  +  1  -  p 


degrees  of  freedom.  That  is 


nin2  (n!  +  n2  "  p  "  ^ 


I 


.ij 


d.  d.  , 


(n}  +  n2)  p  UJj  +  n2  -  2)  j-1  i=l      i  j 

has  the  F-distribution  with  p  and  n.-  +  n„  -  p  -  1  degrees  of  freedom. 

Now  the  critical  region  may  be  selected  from  the  tables  of  the  F-distri- 
bution at  whatever  level  of  significance  is  desired.  Once  we  have  accepted 
the  hypothesis  that  the  two  samples  are  from  different  populations,  let  us 


turn  to  the  problem  of  classifying  V   into  the  population  from  which  it  came. 
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That  is,  we  have  a  sample  vj1^,  .  .  .,  v;1^  from  N(^1JD  and  a  sample 
V^,  .  .  .,  V  from  N(u^   ,  Z)     and,  on  the  basis  of  this  information 


i 


and  the  measurements  taken  on  V  ,  we  want  to  classify  V   as  coming  from 

(1)'  • 

tt,   or  tt9  .  Our  maximum  likelihood  estimate  of  u     is 

X(l)'  =  (xn,  .  .  .,  xlp)  ,   of  u(2)'   is  X(2)'  =  (x21,  .  .  .,  x2p)  , 

and  of  <y.  .     is  s.  .  where  x.  .  and  s.  .  are  defined  on  the  preceding 
ij      ij        kj       ij 

page. 

Substituting  these  maximum  likelihood  estimates  for  the  parameters  we  use 

i 
the  same  criterion  for  the  classification  of  V   as  we  did  in  the  situations 

in  which  the  parameters  were  known. 

Hence  for  the  case  where  p .  =  p2  , 

Rx  :      VVYX^  -  X<2))  »1  (*<*>  +  X(2))'  S'H*{1)   -   X(2))  ,   (11) 

R2  :      VV1^  -  X<2))<±  (X<l}  +  X<2>)'  S"1^  -  X<2))  .   (12) 


R.  A.  FISHER'S  APPROACH  TO  THE  CLASSIFICATION  PROBLEM 

In  1936  R.  A.  Fisher  considered  the  problem  of  discrimination  in  a 
totally  different  manner  and  obtained  similar  results.  Fisher's  approach  was 
as  follows: 

There  are  two  populations  ^.    ,  and  tr_  .  From  each  population  there  is 
a  sample,  n..   items  from  vr.    ,  and  n„  items  from  -jr^  .  Also,  there  is 
an  item,   I  ,  which  could  have  come  from  either  tt,   or  tt2  •  The  decision 
problem  is  then  to  assign  I  to  one  of  the  two  populations,  where  the  avail- 
able information  consists  of  measurements  of  p  quantities  which  are  made  on 
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I  and  the  n  +  ru  sample  items. 

If  there  is  only  one  characteristic,  then  the  problem  of  classification 
is  very  simple;  all  individuals  having  values  of  the  characteristic  exceeding 
a  suitably  determined  value,  could  be  assigned  to  one  group,  and  all  others  to 
the  other  group. 

Fisher  dealt  with  the  multivariate  problem,  i.e.,  p  >1  ,  by  reducing  it 

to  a  univariate  problem.  To  do  this  he  replaced  the  p  measurements  for  each 

individual,  by  a  single  measurement,  say  Y  .  Fisher  considered  only  linear 

combinations  of  the  p  variates.  Therefore  one  has 

Y,  .  =  z.  x,  , .  +  z0  x.  0.  +  .  .  .  +  zn   x,  . 
ki    1  kli    2  k2i  p  kpi 

as  the  linear  combination  of  the  p  measurements  representing  the  ith  individ- 

ual  from  the  kth  sample.  If  one  denotes  the  measurement  of  the  jth  trait  on 

item  I  by  x.  ,  then 

YI  =  Vl  +  Z2X2  +  «  '  '  +  zpxp  ' 

is  the  linear  combination  of  the  p  measurements  representing  the  individual 

I  . 

The  proper  choice  of  the  z.'s  may  then  be  measured  by  the  relative 

ease  of  classifying  I  through  the  use  of  the  values  of  Y..  and  the  Y,  .'s  . 

Fisher  introduces  the  numerical  measure  of  the  ability  to  distinguish  between 

the  two  populations  as  being  the  ratio  of: 

the  difference  between  sample  means 
the  standard  deviation  within  samples 

He  then  was  able  to  suggest  a  reasonable  criterion  for  determining  appropriate 

values  of  the  z.'s  .  This  was  the  linear  function  of  the  measurements  that 

l 

maximized  the  ratio  of  the  difference  between  sample  means  to  the  standard 
deviation  within  sample  means. 
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Mathematically,   Fisher  maximized   the  ratio 

Y      -  Y 
1  2 


/£   <YU  -  Y/  +  £  <Y21  -  Y2>2     /(«l  +  "2  -  2) 


(13) 


where, 


Y  =i-  I  Y 
k   n,  i=l  ki 
k 

is  the  mean  value  of  the  new  variable  for  the  kth  sample.  Note  that  the  con- 
stant factor  of  n1  +  n^  -  2  may  be  omitted  since  constant  factors  do  not 
affect  the  maximization  problem.  Also,  the  square  of  -the  ratio  may  be  con- 
sidered for  ease  of  computation. 

One  can  show  that  Y,  -  Y„  =  X.  z.  d.  ,  where  d .  =  x,  .  -  x_ .  ,  is  the 

1    2   j=l  j  j  j    lj    2j 

difference  between  sample  means  for  the  jth  trait.  This  difference, 
Y  -  Y„  ,  may  be  denoted  as  B  . 

In  a  similar  manner  the  sum  of  squares,  due  to  the  variability  within 
samples  which  is 

nl  n2 

ill  <Yli-V2+iA  <V2i-Y2>  <14> 

can  be  shown  to  be 


z.  z  w.  =  T  ,  (15) 

j-1  m=l  j  m  jm     '  v 

where  w.   is  the  pooled  sum  of  products  of  deviations  from  the  sample  means 

of  traits  j  and  m  ;  that  is, 

2   nk 

Wjm  =  A   £l    (xkji  '  "kj}  (xkmi  +  \m]    '  (16) 

Now  the  problem  is  to  determine  the  values  of  the  z.'s  for  which  the  ratio 
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B 


1  t 


=    (A  z.  d.)   /  .L.     4-  z.  z  w. 
T    vj=l  J   J   '    j=l  m=l  j  m  jm 

is  maximized.   In  order  to  find  the  maximization  solution  differentiate 


(17) 


(17)  with  respect  to  z  ,  and  set  the  derivative  equal  to  zero,   r  -  1,  2, 
.  .  . ,  p  .  This  gives  the  following: 

X  z.  w.  =  d   .£.  t,  z.  z  w.  /  .L  z.  d.  ,   r  =  1,  2,  .  .  . ,  p  . 
j=l  J  jr    r  j=l  mFl  j  m  jm  '  j=l  j  J 

Since  X,     A,  z .  z  w.  /  X,    z .   d .  is  a  constant  for  any  set  of  equations 

j=l  m=l  j  m  jm  '  j=l  j  j 

and  one  is  interested  only  in  a  proportional  solution  it  can  be  ignored, 
leaving 

.£.  z .  w .  =  d   ,  r=l,2,...,p, 

j-1  j  jr    r 

which  is  a  set  of  p  simultaneous  linear  equations: 


Zl  Wll  +  Z2  W12  + 


Z   W    "*"  z   w    + 

1   12    2  22 


Zl  Wpl  +  Z2  Wp2  + 


+  Zp  Wlp  =  dl 
+  Zp  W2p  =  d2 


+  z  w   =  d 
P  PP    P 


Representing  this  system  of  equations  in  matrix  notation,  one  has 

(W)(Z)  =  (D)  (18) 

where 

(W)   is  a  p  x  p  matrix;  while 
(Z)  and  (D)  are  p  x  1  column  vectors. 
Thus  one  has  a  set  of  simultaneous  equations  which  when  solved  will  give 
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the  z  multipliers  which  will  maximize  the  ratio  of  B  /  T  .  This  is  the 
ratio  of  the  square  of  the  difference  between  sample  means  to  the  variance 
within  samples  for  the  variable  Y  .  Once  the  Z  vector  is  computed  one  can 
easily  compute  the  quantities  Y  ,  Y?  ,  and  Y  .  The  problem  now  becomes  a 
univariate  one  and  I  is  placed  in  ir.   or  in  vr0     depending  upon  which  Y. 
that  YT  is  closest  to.  That  is,  if  YT  is  closer  to  Y.   than  to  Y_  , 
classify  I  as  coming  from  population  -jf.  ;  otherwise,  classify  I  as  com- 
ing from  population  rr^  •  For  simplification  of  computations,  note  that, 

\  =   Zl  *kl  +  Z2  *k2  +  '  *  '  +  Zp  \p   ' 
PROPERTIES  OF  THE  DISCRIMINANT  FUNCTION 
Comparison  of  Two  Methods 

It  was  interesting  to  note  that  if  one  took  the  case  studied  earlier 
where  the  density  functions  of  the  two  populations  were  multivariate  normals, 
with  equal  covariance  matrices,  and  equal  a  priori  probabilities;  then  it 
could  be  shown  that  the  discriminating  functions  obtained  by  using  the  two 
different  methods  were,  in  fact,  the  same  functions.  Thus  Welch's  work  put 
a  theoretical  basis  under  Fisher's  discriminant  function;  at  least  in  this 
special  case. 

Consider  the  case  where  Y.   was  found  to  be  the  larger  of  the  two  sample 
means.  Then  using  Fisher's  method  R   would  be  the  set  of  Y   for  which 

Rl  5     YI  ^l  '  "S-2  0r  2(?1  +  V  w 

i.e., 
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Y    -(Y  +  Y  )   Y 
2   2V  1    T  1 


Note  that  YT  =  X.   x.  z,  =  V  (Z)  and 

I   J-l  J  J 


Y,  =  .£  x,  .  z.  =  (X  (k))'  (Z)  , 
k   J=l  kj  j 

where  (Z)  =  (z.,  .  .  .,   z  )     and  (X    )  =  (x,,,  .  .  .,  x.  )  . 

Hence  the  region  R.  of  (19)  can  be  written  as 

v'(z)  ?(x(l))'(z)  -\   [(x(1))'(z)  -  (x(2))'(z)]  . 

The  Z  vector  is  the  solution  of  the  matrix  equation  (18)  and  can  be  ex- 
pressed as   (Z)  =  (W)   (D)  ,  hence,  R.   contains  those  items  for  which 

v'(z)  >[(x(1))'  -\  (x(1V  +1  (x(2))'J  (z) 

v'(W)"1(D)>[i  (X(l)  +X(2))'j  (W)-1(B>) 


(20) 


or 


v'(w)_1(x(1)  -  x(2))  >\  (x(l)  +  x(2))'  (w)_1(x(1)  -  x(2))  . 


Multiplying  both  sides  of  equation  (21)  by  the  constant 


(21) 


n1  +  n2  -  2 


equation  (21)  becomes  (11 ),  for 


nx  +  n2  -  2 


(W)  =  (S)  . 


19 


The  Significance  of  a  Discriminant  Function 

One  may  ask  whether  a  particular  discriminator  is  "significant."  Ques- 
tions of  "significance"  in  discriminant  functions  have  usually  been  discussed 
in  terms  of  whether  or  not  the  parent  populations  are  identical  and  hence 
whether  or  not  a  discriminant  function  is  illusory.  They  are  not  so  much  a 
test  of  the  functions  as  they  are  a  test  of  the  homogeneity  of  the  popula- 
tions, by  use  of  the  functions.  If  heterogeneity  is  found  the  function  is 
significant  in  the  sense  that  it  discriminates  between  real  differences  in 
an  optimal  way.  For  making  this  test  of  significance  Fisher  suggested  the 
use  of  an  analysis  of  variance.  The  total  sum  of  squares  of  deviations  of 
all  observations  from  their  grand  mean  can  be  expressed  as 

li  £  <\i  -  ?-->2 = X  iii  (\i  -  v>2  +  X  «a.  -  =v-»2  (22) 

The  first  component  on  the  right  side  of  equation  (22)  expresses  the  "within 
sample"  variation  and  the  second  component  expresses  the  "between  sample" 
variation.  Furthermore,  the  component  representing  the  "between  sample" 
variation  can  be  written  as: 

n2         nl  n. 


<iii  Ya)2 ,  <&  v2  [£  yii  :  ik  v2 

nl  n2  nl  +  n2 


or 


ni  12 


^2JiYli-nl  iSV     /V2(v+n2) 


which  for  our  discriminant  problem  is: 

o  nino       r>   nino 
v  1    2'  n  +  n„     n^   +  n2 
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Hence  equation  (23)  represents  the  sum  of  squares  due  to  the  variability  "be- 
tween sample"  means.  The  sum  of  squares  due  to  variability  "within  samples" 
is  given  by  equation  (14). 

If  the  system  of  equations  (18),  whose  solution  is  the  (z)  column 
vector,  is  multiplied  on  the  left  by  the  transpose  of  the   (Z)  column  vector, 
i.e. , 

(Z)'  (W)  (Z)  =  (Z)'  (D)  (24) 

the  left  hand  side  of  the  equation  (24)  is  T  the  sum  of  squares  due  to 
variability  "within  samples,"  and  the  right  hand  side  of  the  equation  is  B  , 
the  sum  of  squares  due  to  the  variability  "between  sample"  means.  Hence 
T  =  B  .  Fisher  then  concluded  that  if  the  measurements  were  normally  distri- 
buted, or  nearly  normally  distributed,  then  the  linear  compound  of  measure- 
ments, i.e.,  the  Y.  ,  would  be  normally  distributed.  Therefore,  if  the 
variances  of  the  two  transformed  groups  are  equal,  the  analysis  of  variance 
table  would  be: 


Source  of 
Variation 

Degrees  of 
Freedom 

Sum 
of  Squares 

Between  samples 

P 

ni°2    B2 
ni  +n2 

Within  samples 

nl  +  n2  "  P  "  l 

T  =  B 

Total 

nx  *  n2  -  1 

n.  n_ 
B(l  +  l  J       B) 
nl  +n2 

(25) 


This  analysis  of  variance  gives  a  means  for  testing  the  hypothesis  that 
the  two  samples  are  actually  from  different  populations.  This  situation  would 
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be  indicated  by  a  significant  F  value  with  p  and  n.  +  n2  -  p  -  1  degrees 

2 
of  freedom.  A  comparison  of  this  analysis  of  variance  and  Hotelling's  T 

statistic,  which  was  presented  earlier  in  the  report,  will  show  that  the  two 

are  identical  testing  procedures. 

Computing  Probabilities  of  Misclassification 

The  break-down  of  the  sum  of  squares  in  the  analysis  of  variance  is  of 
interest  also  in  relation  to  the  probabilities  of  misclassification.  The 
within  samples  variation,  divided  by  the  within  samples  degrees  of  freedom, 
gives  an  estimate  of  the  variance  of  Y  .  That  is,  the  estimate  of  the 
variance  of  a  single  item  Y  is  B  /  (n  +  n  -  p  -  l)  .  Using  the  procedure 
(19)  an  element  from  group  two  is  misclassif ied  if  its  deviation  from  Y_  ,  in 
the  right  direction,  exceeds   l/2  (Y  -  Y  )  .  Also  an  element  from  group  one 
is  misclassif ied  if  its  deviation  from  Y.  ,  in  the  right  direction,  exceeds 
l/2  (Y  -  Y„)  .  To  find  the  probabilities  of  misclassification  one  simply 
needs  to  find  the  probability  that  Y  will  exceed  the  deviation  which  will 
cause  misclassification.  Fisher  treats  the  ratio  of  l/2  (Y.  +  Y_)  -  Y,   to 
the  standard  error  of  Y  ,  as  being  distributed  as  Student's  t-distribution 
with  (n.  +  n„  -  p  -  1 )  degrees  of  freedom   (k  =  1  or  2)  .  Thus  to  find  the 
probability  of  misclassifying  an  element  which  belongs  to  group  two,  the 
ratio  of  l/2  (Y..  +  Y„)  -  Y„  to  the  standard  error  of  Y  is  computed.  Then 
by  comparing  this  ratio  to  the  tabulated  values  of  the  appropriate  t-distribu- 
tion one  determines  the  probability  of  misclassification.  That  is,   P(l/2)  , 
is  equal  to  the  probability  of  getting  a  t-variate,  with  appropriate  degrees 
of  freedom,  which  is  greater  than  the  ratio  of  l/2  (Y.  +?_)-?_  to  the 
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standard  error  of  Y  .  The  probability  of  misclassifying  an  element  which 
belongs  to  group  one,  P(2/l)  ,  is  equal  to  the  probability  of  getting  a 
t  variate,  with  appropriate  degrees  of  freedom,  which  is  less  than  the  ratio 
of  l/2(Y  +  Y  )  -  Y   to  the  standard  error  of  Y  .   It  must  be  kept  in  mind 
that  the  deviation  has  to  be  in  the  right  direction  in  order  to  cause  mis- 
classification.  The  total  probability  of  misclassif ication  is  then 
p  P(2/l)  +  p9  P(l/2)  .  In  computing  the  probability  of  misclassif ication 
Fisher  assumed  that  p.  was  equal  to  p2  .  This  of  course  may  not  be  the 
true  situation;  and,  if  not,  an  adjustment  must  be  made.  A  previous  section 
showed  that  Welch's  method  and  Fisher's  method  are  actually  equivalent. 
Using  Welch's  method,  equations  (7  and  8)  gave  the  procedure  for  classifica- 
tion when  p.  i-  p0     but  gave  no  means  for  determining  the  probabilities  of 
misclassif ication.  Following  Rao  (1952),  for  the  case  where  p.  i   p2  ,  one 
can  express  Welch's  region  R.   in  vector  notation  as 

Rx  :  vV1^1)  -  X(2))  -  1/2(X(1)  +  X(2))'s_1(X(l)  -  X(2))  >lnPr  In  ?y    , 

which  can  be  written  as: 

Rx  :  Yj  ^l/2(Y1  +  Y2)  +  In  p2  -  In  P]_  . 

Under  these  conditions  P(2/l)  is  determined  by  finding  the  probability 
that  a  t-variate  with  (n1  +  n„  -  2)  degrees  of  freedom  will  be  less  than  the 
ratio  of  l/2(Y  +  ?2)  +  In  p.  -  In  p  -  Y   to  the  standard  error  of  Y  . 
P(l/2)   is  equal  to  the  probability  that  a  t-variate  with  (n.  +  n„  -  2) 
degrees  of  freedom  will  be  equal  to  or  exceed  the  ratio  of 
l/2(Y  +  Yj  +  In  p.  -  In  p.  -  Y   to  the  standard  error  of  Y  .  Again  the 
deviation  must  occur  in  the  right  direction  to  cause  misclassif ication.  The 
standard  error  of  Y  is  obtained  from  the  analysis  of  variance  of  (25). 
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The  "Doubtful"  Region 

The  classification  problem  is  somewhat  different  than  most  problems  of 
testing  hypotheses.  The  general  procedure  for  testing  hypotheses  is  to 
arbitrarily  set  the  probability  of  a  type  one  error  and  then  use  some  speci- 
fied procedure  to  determine  the  region  of  rejection  for  the  test.  In  the 
above  approach  to  the  classification  problem  the  critical  point  was  determined 
first,  then  the  probability  of  misclassif ication  using  this  critical  point  was 
determined.  The  critical  point  was  chosen  so  that  the  probability  of  misclas- 
sification  would  be  a  minimum  and  was  not  arbitrarily  set  by  the  investigator. 
Only  when  P(2/l)  and  P(l/2)  are  small  can  the  investigator  assert  with  a 
high  degree  of  confidence  that  any  given  individual  is  correctly  classified. 
Rao  (1952)  gives  a  method  by  which  P(2/l)  and  P(l/2)  can  be  arbitrarily 
chosen  by  the  investigator.  To  do  this  Rao  divides  the  R  space  into  three 
regions,  R  ,  R2  ,  and  Rp  .  Individuals  that  fall  in  regions  R   and  R_ 
are  classified  into  population  -rr,   and  ir_  respectively,  and  individuals 
falling  into  R-  remain  in  doubt,  as  to  which  population  they  belong. 

These  three  regions  ares 

Ri  s  f2(v5  e2)  ^A 

f,(v«  e  ) 
%  !  B<f2>7e27<A  W 

f,(vj  e  ) 

rq  :         —i — sn-  <  b 
2  f2(v;  e2;  - 

Then  within  certain  limitations  the  quantities  A  and  B  can  be  chosen  so 
that  the  probabilities  of  misclassif ication  can  be  set  at  preassigned  levels. 
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The  diagram  below  shows  the  nature  of  the  decisions  that  could  be  made  after 
ascertaining  the  value  of  the  ratio  or  its  logarithm. 


R2 


D2 


Dl 


Ri      ,  fi(v;  V 

In 


f2(v;  e  ) 

B  C  A 

i  i  t 

B  =  In  B    A  =  In  A    C  =  In  p2  -  In  p 

An  individual  in  region  R.   can  be  assumed  (at  a  given  risk)  to  belong  to 
population  tri  •  Tne  region  FL  has  a  similar  meaning  for  population  vr     . 
Individuals  falling  into  regions  D   and  D„  may  be  provisionally  assigned 
to  rf      or  vr     ,  respectively.  The  point  B   is  determined  such  that  if  an 
individual  belongs  to  group  1  the  probability  of  its  Y  value  being  equal  to 
or  less  than  B   is  P(2/l)  .  Rao  states  that  one  can  find  this  value  of  B 
by  setting  the  ratio  of  (B  -  Y. )  to  the  standard  error  of  Y  equal  to  that 
ordinate  of  the  appropriate  t-distribution  for  which  the  probability  of  get- 
ting  a  smaller  t  value  is  equal  to  P(2/l)  ;  and  solving  for  B  .  The  point 
A   is  determined  such  that  if  an  element  belongs  to  group  2  the  probability 
of  its  value  being  equal  to  or  greater  than  A   is  P(l/2)  .  One  can  find 
the  value  of  A   by  setting  the  ratio  of   (A  -  Y„)  to  the  standard  error  of 
Y  equal  to  that  ordinate  of  the  appropriate  t-distribution  for  which  the 
probability  of  getting  a  larger  t  value  is  equal  P(l/2)  ;  and  solving  for 
A  .  The  standard  error  of  Y  can  be  obtained  from  the  analysis  of  variance 
of  (25).   It  should  be  noted  that  C   which  equals  zero  when  p.  =  p„  ,  is 
the  critical  value  obtained  when  using  just  two  regions  R   and  R~  . 

One  might  believe  that  the  use  of  the  doubtful  region  is  the  "ideal" 
situation  for  it  gives  a  means  for  controlling  the  probabilities  of  misclas- 
sification.  However  in  a  practical  application  if  P(2/l)  and  P(l/2)  are 
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set  too  small  then  D   and  D   will  become  too  large,  i.e.,  the  doubtful 
region  will  contain  too  many  individuals  and  one  would  be  reserving  judgment 
or  not  classifying  too  many  of  the  individuals. 

THE  PROBLEM  OF  THREE  OR  MORE  GROUPS 

In  the  previous  sections  it  was  seen  that  if  measurements  on  a  certain 
number  of  variates  are  available  for  two  groups,  it  is  possible  to  construct 
a  discriminant  function  which  gives  the  maximum  discrimination  between  them. 
This  function  is  useful  in  assigning,  with  a  certain  degree  of  confidence,  an 
individual  to  one  or  the  other  of  the  two  groups  to  which  it  is  known  to  be- 
long. Let  us  now  consider  the  problem  of  assigning  an  individual  to  one  of 
k  groups  from  which  it  is  known  to  have  come.  Let  -tr, ,  •  .  . ,  tr,   be  k 
populations  with  density  functions  f  (v;  6.),  .  .  .,  fk(v;  6,  )  respectively. 
If  the  p-dimensional  space  is  divided  into  k  regions  R. ,  .  .  .,  R,   such 
that  if  an  observation  falls  into  R.  it  shall  be  classified  as  belonging  to 
tr.  ,  then  the  probability  of  misclassifying  an  observation  from  the  ith  popu- 
lation as  coming  from  the  jth  population  is 


P(j/i)  =  ,/    f  (v;  9  )  dv  . 

R. 
J 

If  a  priori  probabilities  p.  ,  .  .  . ,  p,   of  an  individual  coming  from 
Tr,  ,  .  •  ■  v  TT.   respectively  exist,  and  the  cost  of  misclassifying  an  observa- 
tion coming  from  tt-   as  coming  from  -jr-   1S  C( j/i )  ,  then  the  expected  loss 
due  to  misclassif ication  is: 

k      k  k     k     r 

i=l  Pi  j=l  P(j/i)  C(j/i)  =  ill  Pi  %    J      C(j/i)  fi(v;  9i}  dv  '    (27) 
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The  problem,  then  is  that  of  choosing  the  regions  R  ,  .  .  . ,  R,   so  that 
(27)  is  a  minimum.  Since  a  priori  probabilities  exist  it  is  possible  to  de- 
fine the  conditional  probability  of  an  individual  coming  from  a  population, 
given  the  observed  values  of  the  individual's  p  variates.  The  conditional 
probability  that  an  observation  comes  from  rr-   1S 

p,  f.(v5  e.) 


Pr  (rr./v)  = 


i  i 


i'  * '   k 

I,  P  f  (v;  9  ) 

m-1  m  m     m 

If  one  classifies  an  individual  selected  at  random  as  belonging  to  rr. 

the  expected  loss  is 

k    p.  f.(v;  0.) 
&       -j-* i C(j/i)  •  (28) 


iit         Ip  f  (v;  9  ) 
'    m=  1  "m  m  '  m 


In  order  to  minimize  the  expected  loss  one  chooses  that  j  for  which  equation 
(28),  or  equivalently  for  which 

k 

.^  p.  f^Vi  9.)  C(j/i)  ,  (29) 

is  a  minimum. 

These  statements  are  summarized  by  a  theorem  due  to  T.  W.  Anderson 

(1958).  "If  p.   is  the  a  priori  probability  of  drawing  an  observation  from 

population  ir.   with  density  f.(v;  9. )  ,  (i  =  1,  .  .  .,  k)  ,  and  if  the  cost 

of  misclassifying  an  observation  from  ■$■.      as  from  -rr.  is  C(j/i)  then  the 

regions  of  classification,  R.,  .  .  . ,  R,  ,  that  minimize  the  expected  cost 

are  defined  by  assigning  V   to  R   if 

3  m 

k  k 

Ji  Pi  fi<v;  V  c(mA)  <iI1  Pi  Vv;  ei)  c(j/i)  (30) 

i/m  i^j 

(j  =   1,    .    .    .,    k     j  i  m)    ." 
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If  the  cost  of  reclassification  is  equal  for  all  groups,  that  is  C(i/j) 
constant  for  all  i  and  j  ,  (30)  reduces  to 


R 


f  .(v;  e.)   pm 

In  this  case  the  observation  v'   is  in  Rm  if  m  is  the  index  for  which 

p  f  (v:  9  )  is  a  maximum;  that  is,  v       is  the  most  probable  population. 

Irn  m     m  '" 

The  proof  of  Anderson's  Theorem  will  be  included  at  this  point: 
Note  that  the  expected  loss  due  to  misclassif ication  (27)  can  be  written  as 

i  f      1  P.  C(j/i)  f  (v|  0  )  dv  .  (32) 

J-l  L/R   1-1   1  1       ! 

By  letting 

k 

hj(v)  =  i\  pi  c(j/i)  fi(v;  ei} 


• 


one  has 


J,   /   h.(v)  dv  =  .  /  h(v)  dv 

J 

■ 

as  the  expected  loss  due  to  misclassif  ication,  where  h(v)  =  h..(v)   for  V 

in  R.  . 
J 

For  the  procedure  described  in  the  theorem  .  h(v)   is  h*(v)  =  min  h^(v)  , 
j  =  1,  .  .  .,  k  .  The  difference  between  the  expected  loss  for  any  procedure 

R  and  the  procedure  R*  is 

k 


f  [h(v)  -  h*(v)]  dv  =  X.  f      [h.(v)  -  min  h.(v)]  dv  . 
Jo  J   v  R.   J 


(33) 


R  J 


Equation  (33)  is  seen  to  be  equal  to  or  greater  than  zero.  Therefore  the 
expected  loss  incurred  by  using  any  other  procedure  must  be  equal  to  or 
greater  than  the  expected  loss  incurred  when  using  the  Anderson  theorem. 
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For  further  consideration  of  the  k-group  case,  let  us  consider  three 
populations — the  generalization  to  k  is  immediate.  Furthermore  let  us 
assume  that  the  costs  of  misclassif ication  are  the  same  for  all  three  groups. 
From  (31)  the  regions  for  classification  are: 
Rx    :      Pl  f^v;  9X)  >  p2  f2(v;  9 J    ,    ?l    f^v;  9^  ?  P;J  ^(vj  63) 

R2  :      p2  f2(v;  02)  ^p1    f:(v;  6^  ,  p2  f2(v;  ©2)  >  P3  ^(v;  9g)      (34) 

R3  :      P3  f3(v;  S3)  *  Pl  Vv'  V  •  P3  f3(v;  93}  >P2  f2(v;  92}  ' 

If  the  a  priori  probabilities  are  all  equal  and  the  regions  are  defined  in 

terms  of  their  logarithms  they  become 


(35) 


Rl s 

U12   ^    0 

1 

U13    ^° 

R2    : 

U21^     ° 

5 

U23^     ° 

V 

U31^    ° 

1 

u32    >    0 

where 

These  regions  may  also  be  used  for  classifying  an  observation  when  nothing  is 

known  about  p1  ,  p2  ,  p  ,  the  a  priori  probabilities.  For  computational 

purposes  it  is  advantageous  to  express  the  regions  R  ,  R  ,  R   in  still 

another  form.  By  referring  to  an  earlier  section  of  this  report  one  sees  that 

the  statement  u.  .  "£  0  is  equivalent  to  the  statement,  Y.  .  ^  \(\ .    +  Y.) 
ij  '   ij   2V  1    j' 

where  Y   =  x  z.  .  +  .  .  .  +  x  z .  .   .  That  is  Y.  .   is  a  linear  combination 
ij    1  ljl  p  ljp  ij 

of  the  p  measurements  taken  from  an  individual.  Using  matrix  notation 
Yjj  =  V  (Z)i.  where  the   (z)j4   column  vector  is  the  solution  to  the  system 

of  equations  (I)(Z)   =  (u^1'  -  i/J')  and   £  is  the  common  covariance 
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matrix,  also  Y.  =  (u(i))'  (Z). .  and  Y.  =  (H(j))'  (Z)^  .  Using  the  above 
as  well  as  the  fact  that  u.  .  =   -y.j  the  regions  can  be  expressed  as 

«!  «      *12  >^(Y1+Y2)  Y13  >i(Y1+Y3) 

R2:  V12<i(Y1+Y2)  Y23>2(W  (36) 

R3    '  Y13   <   l(?l  +  V  Y23    <    2(?2  +  V 

It  is  most  important  to  note  that  Y.   is  the  mean  of  the  ith  group  using 
the  discriminant  function  Y.,  ,  and  Y   is  the  mean  of  the  jth  group  using 
the  discriminant  function  Y. .  .  That  is,  in  region  R2  of  (36)  Y2  in  the 

first  statement  is  not  equal  to  Y2  in  the  second  statement.  For  Y2  in  the 

(2)  ' 
first  statement  is  equal  to  (u  ')      (Z).,  and  Y2  in  the  second  statement 

is  equal  to  (u   )  (Z)0^  .  Thus  when  one  speaks  of  Y.   and  Y.  it  must  be 

with  reference  to  a  particular  discriminant  function. 

A  detailed  investigation  into  the  above  procedure  reveals  that  nothing 
more  has  been  done  than  compute  a  simple  (two-group)  discriminant  function  for 
each  possible  combination  of  groups.  That  is,  given  an  element  at  random  we 
use  the  simple  discriminant  function  Y    to  distinguish  between  ^  and 
t(j     and  the  simple  discriminant  function  Y    to  distinguish  between  ^ 
and  rin  •  Similarly,  by  considering  the  other  possible  simple  discriminant 
functions  one  can  determine  the  regions  FL  and  R„  . 

In  the  k-group  classification  problem,  as  in  the  2-group  problem,  the 
regions  R, ,  .  .  . ,  R,   were  chosen  so  that  the  probability  of  misclassif ica- 
tion  would  be  a  minimum.  By  using  this  method  the  investigator  has  no  control 
over  the  error  rate.  So  before  one  asserts  that  any  given  individual  is 
correctly  classified  he  would  like  to  know  what  the  error  rate  is.  When 
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k  =  3  the  total  probability  of  misclassif ication  is  the  sum  of 
P]_  [P(2/l)  +  P(3/l)]  ,  p2  [P(l/2)  +  P(3/2)]   and   p3[P(l/3)  +  P(2/3)]  .   To 
determine  the  probabilities  of  misclassif ication  one  needs  to  find  the 
variances  and  covariances  of  the  three  discriminant  functions.   It  can  be 
shown  that  these  values  are  readily  obtainable  from  the  mean  values  of  the 
functions,  that  is, 

var  (Y 

var  (Y 


12 


23 


var  (Y 


13 


cov  (Y 


cov  (Y 


12 


=  Y   -  Y 

Tl    2 


Y   -  Y 
2    3 


=  Y, 


Y   )  =  Y  -  Y 
X23/    2    : 


(37) 


12'  Y13)  =  Var  (Y12}  +  C0V  (Y12'  Y23) 
cov  (Y23,  Y13)  =  var  (Y^)  +  cov  (Y^,  Y^) 

where  in  the  var  (Y^)  ,  Y.   and  Y_  are  the  means  for  groups  one  and  two, 
respectively,  using  discriminant  function  Y ._  .  Similarly  in  the 
var  (Y„„)  ,  Y9  and  Y_  are  the  means  for  groups  two  and  three,  respectively, 
using  discriminant  function  Y„„  .   In  the  var  (Y,„)  ,  Y.   and  Y-  are  the 
means  for  groups  one  and  three  respectively,  using  discriminant  function 
Y.  -  .  In  computing  the  covariance  (Y,«,  y,J  ,  Y_  and  Y   are  the  means 
for  groups  two  and  three,  respectively,  using  discriminant  function  Y  -  . 
Using  these  variances  and  covariances  one  can  obtain  the  correlations  between 
the  discriminant  functions.  Then  using  the  variances  and  correlations  one  can 
determine  the  probability  of  misclassifying  an  observation  given  the  popula- 
tion to  which  it  belongs.  The  probability  of  correct  classification  for 
group  one  is: 

lie     l.  3   n  „       ^l/o 


Pr[Y12^^Yl+Y2)       '      Y13   *f<V+V] 
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If  the  p-variates  are  normally  distributed  and  hence  the  Y. .  are  normally 
distributed,  one  can  use  the  bivariate  normal  distribution  to  calculate  the 
above  probability.  This  technique  is  illustrated  by  use  of  a  numerical 
example  in  the  appendix. 

Once  the  probability  of  misclassification  is  obtained  it  may  be  so  large 
that  one  could  not  have  much  confidence  in  a  classification  statement.  This 
problem  is  theoretically  resolved  by  the  existence  of  a  "doubtful  region."  A 
region  such  that  if  an  observation  lies  in  it  judgment  is  withheld;  that  is, 
no  classification  is  made.  Rao  (1952)  using  an  extension  of  the  Neyman- 
Pearson  Fundamental  Lemma  has  proved  that  there  exist  regions  R  ,  R_  ,  R_ 
and  a  set  of  doubtful  regions  such  that  the  probability  of  misclassification 
can  be  set  at  a  predetermined  level.  The  approach  to  this  problem  is  similar 
to  that  used  in  the  two-group  problem  where  the  probability  of  misclassifica- 
tion is  set  and  then  regions  R   ,  R„  ,   and  RD  are  determined.  However, 
when  there  are  more  than  two  groups,  the  complexity  of  finding  these  regions 
for  a  particular  problem  makes  their  use  prohibitive. 

It  has  been  assumed  that  the  distributions  of  the  three  populations  were 
completely  known.  When  the  parameters  are  unknown  and  must  be  estimated  from 
samples  one  can  substitute  the  maximum  likelihood  estimates  for  the  unknown 
parameters.  To  determine  the  regions  of  classification  one  then  treats  these 
maximum  likelihood  estimates  as  if  they  were  the  parameters  of  the  distribu- 
tion. 

UNEQUAL  COVARIANCE  MATRICES 

Now  let  us  consider  the  case  in  which  the  two  multivariate  normal  popula- 
tions with  unequal  mean  vectors,  also  have  unequal  covariance  matrices.  Let 
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l_       represent  the  covariance  matrix  for  the  ith  population  i  =  1,  2 
the  region  F^   of  (2)  is  written  as: 

--1 
p72  ' 


Then 


I 


1 


1/2   1   , 
e 


) 


R 


(2TT) 


i 


I 


-1 


Y7^^    ,  \  >  pi 

"  2  Q2(xl'  *  *  "  V 


(38) 


(2^ 

otherwise 


where 


n  [Y        y  )  =  L      £  o"1-1  (x.  -  u,  .)(x.  -  u,  .) 


Writing  the  regions  R   and  R2  in  terms  of  their  logarithms  one  has 


Q1(x1,  •  •  .,  x  )  <  Q2(x1, 


*1        Pl 
x  )  +  In  ■—  '  +  2  In  - 


R2  i    Q1(x1, 


(39) 


•  p)>Q2(x1,  •  •  -,  xp)  +  in  -j^-r  +  2  in 


2       l'l'      -p-^i  h       |/.2j        ^2 

For  the  k-group  problem  following  this  procedure  R.,  .  .  . ,  R^  of  (30) 

become 

111-        P1 

R.  :    Qj(xlf  •  .  .,  xp)  S^V    '    '    "  V  "  l09  frf"  +  2  1P  t     (40) 

(i,  J  =  1,  2,  .  .  .,  k  j  i   i)  . 

Cooley  and  Lohnes  (1962)  were  concerned  with  the  general  problem  of  dis- 
crimination, with  emphasis  on  the  problem  of  comparing  the  profile  of  an 
individual  with  that  of  a  group.  Their  interest  was  that  of  being  able  to 
tell  a  prospective  student  for  a  given  curriculum  how  favorably  he  compared 
with  successful  students  in  that  field.  For  present  purposes  consider  taking 
only  two  measurements  from  each  individual,  so  that  the  group  to  which  the 
individual  is  being  compared  can  be  considered  to  be  a  bivariate  normal 
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population.  One  way  to  describe  such  a  bivariate  distribution  is  in  terms  of 
ellipses,  each  of  which  is  the  locus  of  points  of  a  specified  density.  For 
the  bivariate  normal  distribution,  the  size  of  the  ellipse  is  determined  by 
the  value  of  the  quadratic: 

Q<V  x^l!  A  ^t-i  -<•!><»;)  -"j>  •  (41) 

Each  individual  is  represented  as  a  point  in  the  sample  space,  and  each  point 
can  be  located  on  a  particular  ellipse  by  substituting  the  individuals 
observed  values  into  equation  (41 ). 

If  the  individual  is  selected  at  random  then  Q(x1,  x2)  is  distributed 
as  fX?  i  with  tw0  degrees  of  freedom.  Since  the  tabled  probability  of  a 
given  ^     is  the  likelihood  of  obtaining  a  larger  value,  it  is  also  the 
proportion  of  sample  points  that  would  be  expected  to  lie  beyond  the  ellipse 
on  which  (x  ,  xj  lies.  The  ellipse  used  in  -this  manner  is  called  a 
centour,  and  it  is  a  good  index  of  the  extent  to  which  an  individual  resembles 
a  particular  group.  The  generalization  of  the  centour  method  to  the  measure- 
ment of  p  variables  on  each  individual  is  obvious.   (41 )  becomes 

V*i •  *P)  =  k  k^  '  **){*r  ?d  l42) 

and  is  disbributed  as  X   with  p  degrees  of  freedom. 

Cooley  and  Lohnes  suggest  using  the  centour  method  for  classifying  indi- 
viduals into  one  of  k  groups.  The  classification  rule  is  to  assign  an 

individual  to  that  group  for  which  its  centour  is  highest  or  in  other  words, 

2 

its  "JC     is  smallest.  Thus  for  the  two  group  case: 


and 


Q1(x1,  .  •  .,  x  )  <  Q2(x1?  .  .  ),  xp), 


R2  :      Q1Cx1»  .  .  .i  x  )  >  Q2(x1,  .  .  .,  xp) 


(43) 
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If  there  are  k  groups  the  regions  are: 

Rj  :      Qj(xl»  *  '  *'  V  -  Qi(xl'  *  *  ''  Xp) 

(i,  j  =  1,  •  •  .,  k   i  i  J) 
These  regions  are  seen  to  be  special  cases  of  regions  (39)  and  (40). 

When  the  covariance  matrices  are  equal,  an  individual  who  lies  in  R.   in 
the  sample  space  of  this  section  will  also  lie  in  R.   in  the  discriminant 
space.  Therefore  under  these  conditions  it  is  advantageous  to.  use  the 
discriminant  space  or  discriminant  function.  However,  when  the  covariance 
matrices  are  unequal  the  regions  of  the  discriminant  functions  have  not  been 
given  in  a  convenient  form.  Therefore  when  the  covariance  matrices  for  the 
k-groups  are  not  equal  one  can  use  the  sample  space  of  this  section,  i.e., 
regions  (39)  and  (40)  for  the  classification  of  an  individual. 
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APPENDIX 
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To  illustrate  the  use  of  a  discriminant  function  for  classifying  an 
individual  into  one  of  two  groups,  a  numerical  example  has  been  taken  from 
the  book,  Statistical  Analysis  in  Biology  by  K.  Mather.  Mather  was  faced 
with  the  problem  of  classifying  flies  into  one  of  two  races.  To  do  this  he 
took  a  sample  of  eleven  flies  from  each  race  and  measured  two  traits  on  each 
fly.  The  observed  values  for  both  traits  of  the  flies  as  well  as  the  value 
of  the  discriminant  function  for  each  fly  are  given  in  table  1. 


Table  1. 

Race  it-. 

• 

Race  -rr? 

Trait   ! 
No.  1    : 

:    Trait   : 
!    No.  2    : 

li     : 

Trait 
No.  1    : 

;    Trait    : 
!    No.  2    : 

Y2i 

6.36 

5.25 

2.546 

6.00 

4.88 

2.394 

5.92 

5.12 

*2.402 

5.60 

4.64 

2.245 

5.92 

5.36 

2.434 

5.64 

4.96 

2.299 

6.44 

5.64 

2.623 

5.76 

4.80 

2.313 

6.40 

5.16 

2.547 

5.96 

5.08 

2.409 

6.56 

5.56 

2.647 

5.72 

5.04 

2.333 

6.64 

5.36 

6.644 

5.64 

4.96 

2.299 

6.68 

4.96 

2.602 

5.44 

4.88 

2.231 

6.72 

5.48 

2.682 

5.04 

4.44 

2.056 

6.76 

5.60 

2.710 

4.56 

4.04 

1.863 

6.72 

5.08 

2.629 

5.48 

4.20 

2.152 

The  discriminant  function  will  be  of  the  form  Y.  =  z^  +  z^  where 
z   and  z   are  the  solutions  to  the  following  equations: 
2.628364z  +  1.277382z2  =  .934545 

1.277382z  +  1.748655z2  =  .603636  . 

Solving  for  z,   and  z„  one  has 

YT  =  .291174X,  +  .132507xo  . 
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Now 


Yj  =  x11z  +  x12z2  =  2.5880  and  Y2  =  x^Zj^  +  x22z2  =  2.2359  , 


for  x   =  6.465454  ,  x   =  5.323636  ,  x21  =  5.530909  and  x22  =  4.720000  . 
The  quantity  -z   (Y  +  Y  )   is  equal  to  2.41195  ,  therefore,  the  regions  of 


classification  are: 


Rl  : 


Y   ^2.41195 


Y   <  2.41195 


(45) 


Consider  an  individual  whose  traits  are  observed  to  be  x,  -  6.12  and 
x„  =  5.05  .  For  this  individual  Y.  =  2.451145  ,  hence  the  individual  is 


classified  as  belonging  to  race  tr 


1 


Next  d  =  x..  -  x„  =  .934545  ,  d  =  y.       -   x22  =  .603636  ,  and 
=  z  6     +   z2d2  =  .352101  . 

Therefore  the  analysis  of  variance  of  Y  may  now  be  written  as  follows; 


Source  of  Variat 

ion 

d.f. 

Ss 

Ms 

Variance  Ratio 

Between  Races 

2 

.681863 

. 340932 

18.397 

Within  Races 

19 

.352101 

.018532 

Total 

21 

1.033964 

By  consulting  the  tables  of  the  F-distribution,  it  is  seen  that  the 
probability  of  a  variance  ratio  with  2  and  19  degrees  of  freedom  exceeding 
8.18  is   .01  .  Hence,  the  discriminant  function  is  highly  significant. 
Which  indicates  that  if  one  were  to  apply  the  discriminant  function  to  each 
member  of  the  two  groups,  and  then  perform  an  analysis  of  variance  to  test 
the  hypothesis  that  the  two  transformed  groups  have  equal  means  he  would 
find  a  significant  difference  between  the  group  means. 

Assuming  that  the  two  races  are  equally  likely  to  occur,  misclassif ication 


38 


of  an  individual  will  occur  when  its  departure  from  the  racial  mean  is  greater 
than  one  half  the  difference  between  racial  means,  namely   .17605  ,  provided 
that  departure  occurs  in  the  right  direction.  A  deviation  of   .17605  is 
1.293  times  the  standard  deviation  of  Y  ,  as  estimated  by  19  degrees  of 
freedom.  Now  |t   |  exceeds   1.293  by  chance  about  20  per  cent  of  the  time. 
Since  misclassif ication  occurs  in  one  direction  only,  the  probability  of 
misclassif ication  using  this  discriminant  function  is   .10  . 

If  the  investigator  wanted  the  probability  of  misclassif ication  to  be 
equal  to  .05  ,  then  the  regions  R  ,  R2  and  RD  of  (26)  are  determined. 
To  find  the  region  R_  such  that  the  probability  of  an  individual  from  race 
one  falling  into  R~  is  equal  to  5  per  cent  one  solves  the  equation 
(B  -  Y  )  /  .13613=  -2.093  for  B   .  Similarly  for  R^     one  solves 
(A*  -  Y2)  /  .13613=  +2.093  for  a'  .  Since  Y}  =  2.5880  and  Y2  =  2.2359 
B  =  2.3031  and  A  =  2.5208  .  Thus  if  the  regions 

R  :      Yj  >  2.5208 

Rp  :      2.3031  <   Yj  <  2.5208 
R2  :      2.3031  <  Yj 

are  used  the  probability  of  misclassifying  an  individual  selected  at  random  is 
.05  .  Classifying  all  the  individuals  in  the  two  samples  by  use  of  the 
discriminant  function  (45)  it  is  seen  that  only  one  error  would  be  made.  The 
second  individual  in  the  sample  from  race  ti\  ,  which  is  indicated  with  an 
asterick  in  Table  1,  would  be  misclassif ied  as  coming  from  race  ir2  . 

To  illustrate  the  3-group  classification  problem  let  us  examine  an 
example  from  Rao  (1952).  Table  2  gives  the  mean  values  of  each  character- 
istic for  the  three  groups. 
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Table  2. 

Mean 

Val 

ues 

of  Chara 

cteristic 

Measured 

Group 

*] 

: 

*2 

• 

*3 

•• 

*4 

A 

164 

51 

86.43 

25 

.49 

51. 

24 

B 

160 

53 

81.47 

23 

.84 

48. 

62 

C 

158 

17 

81.16 

21 

.44 

36 

72 

The  sample  estimate  of  the  covariance  matrix  is: 


"1 

32.45 


A2 

7.43 

10.24 


A3 
1.78 
1.17 
3.06 


and  the  inverse  of  this  covariance  matrix  is: 


Al 
.0371 
.0245 


A2 
,0245 


,1212   -.0248 


■.0088   -.0248 


,3680 


,0059   -.0125   -.0457 


4 

3.97 

2.43 

1.78 

12.25 


.0088   -.0059 


.0125 
.0457 
.0927 


=  S 


=  S 


Y..  =  v'(2),,  where   (Z) .  .  =  (S_1) (X(i )  -  X( j } )  , 

so  that  one  wants  to  determine  next  the  vectors  of  differences  between  group 
means.  They  are  as  follows: 
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(x(1)  -x{2))= 


3 

.98 

4 

.96 

1 

.65 

.2 

.62. 

(X(1)  .  x(3))= 


6.34 
5.27 
4.05 
4.52 


(,(2)  .  ,(3),  . 


"2 

36  " 

0 

.31 

2 

.40 

1 

.90 

Now  one  can  compute  the   (Z). .   column  vectors  which  are: 


(z)12  =    -.( 

.4301 
.3293 
.0819. 

The  discriminant  functions  are; 


(2) 


13 


.0437 

.3265 

1.0972 

.1305 


(Z) 


23 


.0476 

-.1036 

.7679 

.0486 


Y12  =  V  (Z)12  =  --0039  xi  +  -4301  x2  +  ,3293  X3  +  '°819  X4 

Y  =  V* (Z)13  =  .0437  Xl  +  .3265  x2  +  1.0972  x3  +  .1305  x4 

Y  =  v'(Z)23  =  .0476  x.  +  .1036  x2  +  .7679  x3  +  .0486  x^    . 

To  find  the  mean  values  of  the  discriminant  function  for  the  three  groups 
one  evaluates: 

Y..  =  (X(i))(Z)..  and  Y..  =  (X(j))(Z)..  , 

for  all  i  and  j  . 

Mean  Values  of  the  Discriminant  Functions 


Group 

Y12 

Discriminant  Function 
Y13 

Y23 

A 

V 

49.1224 

Y  =  70.0630 

Y  =  20.9406 

B 

\  = 

46.2467 

Y2  =  66.1173 

Y2  =  19.8706 

C 

V 

45.1766 

Y  =  63.0317 

Y  =  17.8551 
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Thus  the  regions  of  classification  of  (36)  are: 

R   :      Y    ^  47.6845  Y13  ^  66.5473 

R2  :      Y12  <  47'6845  Y23  ^  18-8628 

R3  :      Y13  <   66'5473  Y23  <  18-8628 

Consider  an  individual  whose  traits  were  observed  to  be 

X.  =  162.00      x2  =  84.00      x3  =  24.00      x4  =  49.00 

For  this  individual  Y  2  =  47.4129  ,  Y13  =  67.2327  ,  and  Y23  =  19.8198  . 

Since  47.4129  ^  47.6845  and  19.1898  >  18.8628  the  individual  is 
assigned  to  group  B  . 

To  determine  the  probabilities  of  misclassif ication  one  needs  the 
variances  and  covariances  of  Y  2  ,  Y^  ,  and  Y23  .   Referring  to  (37)  it 
follows  that: 

var  Y   =  2.8757  cov  (Y^,  Y23)  =  1.0701 


var  Y13  =  7.0313  cov  (Y^,  Y^)  =  3. 


9458 


var  Y23  =  2.0155  cov  (Y23»  Y^)  =  3.0856 

Thus  the  correlation  matrix  of  Y   ,  Y   ,  and  Y2   is: 


Y12 

Y13 

Y23 

Y12 

1.0000 

.8810 

.4459 

Y13 

1.0000 

.8200 

Y23 

1.0000 

The  probability  of  correctly  classifying  an  individual  from  group  A  is; 
P(Y  2  >  47.6845   ;  Y    >  66.5473)  , 

which  gives  the  standard  bivariate  normal  deviates: 
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.         47.6845   -   49.1224  __  ,     ,    _   66.5473  -   70.0630  _      . 

hs  _ =..85     and      k-  g-gg 1.33. 

Therefore,    the  probability  of  misclassifying  an   individual   from  group     A     is: 

Pr(h    ^.85)   +  Pr(k   >  1.33)    -  Pr(h    >.85,    k  >  1.33;   r  =    .88)  =    .198 

+  .092  -  .085  =  .205  . 
The  first  two  probabilities  are  obtained  from  the  univariate  normal  tables 
while  the  third  is  taken  from  Pearson's  tables  for  the  bivariate  normal 
distribution. 

Similarly  for  group  B  the  deviates  are  h  =  .85  and  k  =  -.71  , 
r  =  .45  ;  so  that  the  probability  of  misclassifying  an  individual  from  group 
B  is: 

Pr(h  >  .85)  +  Pr(k  >  .71)  -  Pr(h  >  .85,  k  > .71;  r  =  -.45)  =  .198 

+   .239  -  .013  =  .424  . 

For  group  C  the  deviates  are  h  =  .71  and  k  =  1.33  ;  r  =  .82  ;  so 
that  the  probability  of  misclassifying  an  individual  from  group  C  is: 

Pr(h  >  .71)  +  Pr(k  >  1.33)  -  Pr(h  >  .71,  k  >  1.33;  r  =  .82)  =  .239 

+  .092  -  .085  =  .246  . 
Assuming  that  p.  =  p_  =  p„  ,  the  probability  of  misclassifying  an  individual 
taken  at  random  from  one  of  the  three  groups  using  these  discriminant 
functions  is:   l/3(.205)  +  l/3(.424)  +  l/3(.246)  =  .29  . 
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This  report  discusses  the  problem  of  classifying  a  single  element  into 
one  of  k  populations  from  which  it  is  known  to  have  come.  The  basis  for 
classification  is  whatever  evidence  is  available  about  the  element  I  and 
the  k  populations.  The  first  case  considered  is  that  of  classifying  an 
element  into  one  of  two  populations;  the  k  population  case  follows. 

The  technique  used  is  to  divide  the  sample  space  into  two  regions,  R. 
and  R_  ,  such  that  if  an  observation  belongs  to  R.   it  is  classified  as  com- 
ing from  population  one;  and  if  the  observation  belongs  to  R„  it  is  clas- 
sified as  coming  from  population  two. 

If  the  probability  distributions  of  the  two  populations  are  completely 
known  and  a  priori  probabilities  of  belonging  to  population  one  and  two, 
respectively,  exist,  then  the  method  of  Bayes  may  be  used  to  determine  the 
regions  of  classification.  When  no  _a  priori  probabilities  exist  the  minimax 
solution  is  obtained. 

If  the  form  of  the  probability  distributions  is  known,  but  only  maximum 
likelihood  estimates  of  the  parameters  are  available,  one  treats  these 
estimates  as  the  unknown  parameters.  These  estimates  are  obtained  from  two 
samples,  one  from  each  population. 

Fisher  approached  the  classification  problem  by  considering  the  linear 
function  of  the  p  measurements  taken  from  the  item  that  would  maximize  the 
ratio  of  the  difference  between  sample  means  to  the  standard  error  within 
samples.  Under  certain  conditions  the  two  procedures  result  in  the  same 
regions  of  classification. 

The  question  of  significance  of  a  discriminant  function  is  also  consid- 
ered. This  is  discussed  in  terms  of  whether  or  not  the  parent  populations 

are  identical,  and  hence  whether  or  not  a  discriminant  function  is  illusory. 

2 

Either  Hotelling's  T   statistic  or  the  analysis  of  variance  suggested  by 


Fisher  may  be  used  for  this  problem.  The  analysis  of  variance  is  also  useful 
in  determining  the  probability  of  misclassif ication.  The  probability  of 
misclassification  can  be  arbitrarily  set  by  the  investigator  with  the  intro- 
duction of  a  "doubtful  region."  That  is,  a  region  for  which  judgment  is 
withheld. 

The  problem  of  three  or  more  groups  is  approached  by  the  use  of  a  set  of 
discriminant  functions.  That  is,  a  set  of  discriminant  functions  is  obtained 
for  the  determination  of  classification  between  all  possible  pairs  of  groups. 

When  the  covariance  matrix  of  the  p  characteristics  measured,  is  not 
the  same  for  both  populations;  the  discriminant  functions  are  no  longer 
linear  functions.  This  problem  of  unequal  covariance  matrices  is  considered 
briefly. 


